Final Project

August 11 2024
Preapared by Yen-Ming (Max) Chen

Introduction

Everyone loves trees; they are essential for our environment and planet, and they are nice to look at. To gain a better understanding of the local distribution and statistics of trees around the greater Vancouver area, we will be conducting an analysis and formulate a report using a subset of "Vancouver Street Trees" dataset. The dataset used is provided by the City of Vancouver, who compiled and rightfully owns it, and was shortened and cleaned up (to an extent) by the faculties of UBC's "Intro to Data Visualization" course.

The report aims to answer the following questions that center around the diameter, height range, and genus of trees, as well as them being on/off the curb, in all neighborhoods.

Questions of Interest

  1. Which neighborhood has the most amount of trees planted? And how many of those trees were planted on/off the curb?
  2. What is the distribution, including quartile, median, min/max, and outliers, of the height range and diameter in each neighborhood?
  3. What are the numbers of trees of all genera in each neighborhood?
  4. What are the diameter and height range for all trees in all neighbourhoods and of all genera?

As we begin our analysis later on, we will be focusing on dimension-related columns, such as height, diameter, and count, as well as genus and neighborhood names. The report will start off with more of a macro view of all trees and neighborhoods before focusing more on the distributions of different categories across all neighborhoods. Once all plots have been assembled, we will put together a dashboard of all graphs at the end of the report!

Analysis

Data & Library Imports

Dataset Description

As evident in the output above, the "Vancouver Street Trees" (now known as trees_df) contains 21 columns and 5,000 rows. Within the 21 columns, three contain null values, such as 'date_planted', 'cultivar_name', and 'plant_area', while two others entail ambiguous meaning to their data entries, like 'Unnamed: 0' and 'assigned'; all will be filtered out and cleaned up in the upcoming data wrangling section.

Data Wrangling

To facilitate ease of reading and save ourselves the trouble of renaming titles again and again, let's rename all columns to a reader-friendly format!

As great as it would have been to include 'date_planted' in our analysis, due to the significant amount of null values present in that column, it would be best to exclude it instead. With the data set now wrangled, it is time to move onto the next part of our report — answering the previously proposed questions!

Question 1: Which neighborhood has the most amount of trees planted? And how many of those trees were planted on/off the curb?

A histogram chart would be the best fit to this question, as bar charts are easy to compare both against each bar and against other bar graphs. Figure 01 will comprise of stacked bars without any color differentiation for the 'Curb' category; this is done so to give the best overall look at the total numbers of trees in each neighbourhood.
Figure 02, on the other hand, will feature unstacked bars to more effectively communicate the difference of the numbers of trees planted on/off the curb in their respective neighbourhoods. Both plots will feature a 'clickable' selection option to provide more visual clarity for the readers/audeince. Additionally, tooltip will be added to indicate the exact count for each bar.

Based on the output of both Figure 01 and 02, we can see that Renfrew-Collingwood not only has the title of "most trees planted on the curb" but also "most trees planted overall" when compared to all other neighbourhoods. Dunbar-Southland takes crown when examining only the trees planted off the curb.

Question 02: What is the distribution, including quartile, median, min/max, and outliers, of the height range and diameter in each neighborhood?

Since we wish to look more closely at the statistics and distribution of trees in all neighbourhoods, it is best to graph a boxplot. To answer Question 02, we will be putting 'Neighbourhood' on the Y-axis and 'Height range' and 'Diameter' on the X-axis to form a repeat plot. Additionally, we will also be adding tooltip for the X-axis columns to gain a better gauge on the specific value of each outliers!

Looking at the Height graph, the median of all neighborhoods seem to hover around 2 to 3, and most, except for Sunset, share a minimum of 1; this distribution reflects the reality more closely as the entries listed were of ranges, instead of definitive values.
Shifting our attention to the other Diameter graph, the median of the majority of neighbourhoods seems to hover around 10, but the minimum value looks to vary more drastically, ranging from 0 to 3. The outliers for the diameter graph are substantially more in quantity and with greater variation!

Question 03: What are the numbers of trees of all genera in each neighborhood?

To adequately plot all genera in all neighbourhoods, it is best to put together a scatterplot with 'Neighbourhood' on the X-axis and 'Genus' on the Y-axis; such a graph will showcase the total count of trees. The 'count()' function will be fitted to the color & size channel to more clearly enhance the quantity difference visually. Additionally, tooltip will be included, which will illustrate the exact count of records, and a dropdown menu will be featured to give the readers/audience the ability to filter which exact 'Neighbourhood' to focus on.

Looking at the output of Figure 04, the genera 'Acer' and 'Prunus' are the most common in most neighbourhoods. Several genera, such as 'Koelreuteria' and 'Ilex', are none-existing in most neighbourhoods (and overall). 'Fraxinus', 'Carpinus', and a few others seem to sit more comfortably in the middle, but the difference between them and the two reigning genera is still quite significant.

Question 04: What are the diameter and height range for all trees in all neighbourhoods and of all genera?

Similar to Question 03, we will also be plotting a scatterplot, except there will be more information and columns involved. 'Diameter' will be set to the X-axis and 'Neighbourhood' will be set to the Y-axis. The graph will also include a slider for the 'Height' column, with additional parameters of set=1 & max=9 to contain the slider values properly, and tooltip to communicate the exact values of 'Height', 'Diameter', 'count()', and 'genus'.

The majority of trees seem to lie within the diameter range of 2 to 36; going above that range, the amount of trees fall off drastically. There are also quite a few outliers, especially the ones over 45 in diameter (roughly a dozen). By using the 'Height' slider, it is evident that the higher the value (the taller the trees), the lesser the quantity. Once the slider goes above 6, the number of circles diminishes quite substanitally. Vancouver does not seem to have a lot of tall trees.

Discussion

Vancouver has always been known for its outdoor lifestyle and nature-rich environment, so certain questions and answers that I found were not all that surprising; however, some did manage to catch me off-guard. Let's break it down from the beginning!

In Figure 01, we can see that Renfrew-Collingwood takes the crown for having the most amount of trees planted in total. Nore than half of all neighbourhoods seem to be situated at around 150 to 300, while Strathcona unfortunately having the least amount of trees with only 75. In Figure 02, it is not a surprise to learn that growing on and off the curb makes a difference, but what is surprising for me is seeing the difference and how drastic it is! Dunbar-Southlands takes home the title of having the most amount of trees planted off the curb at 63, but the figure dwarfs significantly when compared to how many trees the nighbourhood has growing on the curb of 250; there are four times more trees planted on the curb than off the curb! All other neighbourhoods share similar results, with the difference ratio being even more magnificent!

Moving on to Figure 03, the boxplot conveys a lot of information to us, both for 'Height' and 'Diameter' distributions. Let's first look at the 'Height' boxplot. The median value for height range is around 2 to 3, while the third quartile is more commonly situated at 4. Only five neighbourhoods don't have a an outlier, but two of them have a maximum of 9, which is also the max height range from the data set. Shifting our focus now to the 'Diameter', the variation of each category of information bewteen each neighbourhood stats increases significantly. The median value seems to hover around 10, with greater variation than 'Height' boxplot. All neighbourhoods have a minimum of lower than 10 and various different maximum value. Downtown seems to have the smallest degree of difference, as their first quartile being 4 and third quartile sitting exactly at 10; such a statistic makes perfect sense once you consider how limited and crowded the downtown area is, thus not allowing many trees to growing bigger and wider.

What astonishes me about Figure 04 is how dominant certain genera are when compared to others. 'Acer' and 'Prunus' completely overshadow all other genera in all neighbourhoods. Kensington-Cedar Cottage has the most amount of 'Acer', 90, while Victoria-Fraser possesses the largest quantity of 'Prunus' at 94. About a third of all genera have a very limited range of entries in most neighbourhoods, such as 'Chitalpa' and 'Nyssa' (to name a few). Not being a botanist myself, it is difficult to ascertain how certain genera gain their dominance in Vancouver, but putting aside the potential reasons why, the stats paint a stark picture of a winner-take-all situation and a highly imbalanced distribution of genera.

Lastly, looking at Figure 05, perhaps the most complex graph of this report, we can see most trees fall between the diameter range of 2 to 36. As we shift the slider more to the right, increasing the height, the diameter value grows and the thicker most trees become; putting that into the context of our reality, a tall tree is more likely going to require a thicker trunk for it to sustain its height. While not every tall tree has a wide diameter, but the overall distribution based on the visual changes from the slider selection confirms that assumption. Interestingly, the two trees with the widest/thickest diameter of 71 only have the height range of 5 (but then again, that's why they are outliers!).

Originally, I did intend to explore more of the evolution of tree diameters and height ranges over time, but due to the amount of null values present in its temporal category, I had to shift focus onto examining all neighbourhoods instead; questions, such as has the diameter of trees gotten thicker or the height of trees gotten taller, were scraped as a result. Another gripe I have against this data set is its definition of 'Height range'. Without exact values being inputted, we lose a lot of details in our analysis and our ability to create more sophisticated graphs become limited.

Despite the limitation of the data set, the report and dashboard have been quite satisfying for me to put together, and its results have been welcoming as well (both for confirming my own pre-conceived notions as well as surprising me in some aspects). I hope you have enjoyed reading this analysis as much as I have compiling it!

Dashboard

The following dashboard is comprised of all the graphs that I have made in this report. Figure 01 and 02 will be concated horizontally, as they are designed to be viewed with and against each other for better comparison, while 03, 04 and 05 will be stacked beneath them vertically. Figure 05 is designed to be the ultimate combination of all the graphs that came before it; therefore, it is situated at the very bottom of the report. One improvement that I would've liked to make, should I have more time and technical know-hows to achieve it, is to attach the Figure 04 dropdown menu and Figure 05 slider beneath their respective graphs. Having the two selection tools clustered at the bottom and out-of-order just seem messy to me. But the dashboard, as it stands currently, serves the purpose of answering all of my proposed questions visually and allows the reader/audience to interact with it with ease and clarity, so it's good enough!

Reference

  1. Introduction to Data Visualization. University of British Columbia. https://canvas.ubc.ca/course/140055
  2. Vancouver Street Trees. City of Vancouver. https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name